feat(sedona-raster-zarr): cloud storage backends (s3, gcs, azure, http) via object_store by james-willis · Pull Request #888 · apache/sedona-db

james-willis · 2026-05-29T06:47:38Z

Summary

Adds cloud storage backends to sedona-raster-zarr. ZarrChunkReader now accepts file://, bare paths, s3://, gs:// / gcs://, az:// / abfs:// / abfss://, and http:// / https://.

import sedonadb
import sedonadb_zarr

con = sedonadb.connect()
con.read_format(
    sedonadb_zarr.ZarrFormatSpec(),
    "s3://its-live-data/datacubes/v2/N40W120/ITS_LIVE_vel_EPSG32610_G0120_X250000_Y5450000.zarr",
).show()

All Async

This package has been migrated to use all async for record loading and byte retrieval. This allows a single approach (cloud was always going to be async, but file system was originally sync) without shimming sync fs code into the async RS_EnsureLoaded udf and async cloud access code into the sync record loading machinery.

Version pinning

zarrs_object_store is pinned to 0.5 in the workspace. 0.6 (the latest) depends on object_store 0.13, which is semver-incompatible with the object_store 0.12.x the rest of the workspace uses through DataFusion 52.

Merge coordination (with #886)

#886 adds the OutDb byte-loading infrastructure (AsyncByteLoader, OutDbLoadRequest, the Zarr ZarrLoader). Whichever of #886/#888 lands second must reconcile:

If Lazy Raster Loading Support for Zarr and GDAL #886 has merged: align the sedona-raster-zarr ZarrLoader onto this PR's all-async storage path (it lands sync in Lazy Raster Loading Support for Zarr and GDAL #886).
If Lazy Raster Loading Support for Zarr and GDAL #886 has merged: wire the session Arc<dyn ObjectStore> carried on OutDbLoadRequest into the Zarr loader's reads, using open_storage_from_uri(uri, store).

…ect_store ZarrChunkReader now accepts file://, bare paths, s3://, gs:// / gcs://, az:// / abfs:// / abfss://, and http:// / https://. The reader's public API is pure-storage-in: hand it an Arc<dyn AsyncReadableListableStorageTraits> and a group URI (for chunk-anchor URI formatting) and it walks the chunk grid. Where the storage came from — registry lookup, env-var-built builder, an in-memory test fixture — is the caller's concern, not the reader's. * `open_storage_from_uri(uri, store_override: Option<Arc<dyn ObjectStore>>)` is the helper for callers that don't already hold a credentialed store. `Some(store)` uses it directly (with PrefixStore for s3/gs/az; http(s) is rooted at the URL by its builder). `None` falls back to per-scheme *Builder::from_env().with_url(uri).build() — the same env-var credential discovery that read_format's orchestrated path gets via ensure_object_store_registered_with_options. file:// and bare paths always go through zarrs_filesystem's FilesystemStore wrapped in SyncToAsyncStorageAdapter, so the local backend shows up on the same async storage surface. * ZarrChunkReader::try_new is async; the sync RecordBatchReader streaming surface is unchanged since next() is pure CPU. * No dedicated tokio runtime; async tasks run on whatever runtime the caller is on (DataFusion's executor for the future SQL UDTF, an ad-hoc current-thread runtime block_on'd in PyZarrChunkReader::new for the Python FFI, #[tokio::test] in tests). * PyZarrChunkReader::new builds storage via open_storage_from_uri (env-var credentials by default), then drives the async try_new through a local current-thread tokio runtime. No synthetic RuntimeEnv, no datafusion-execution dep on the FFI cdylib. * Replaces Group::child_arrays (which opens every child up front and errors hard on any per-array metadata failure) with two purpose- built paths driven by the caller's arrays filter: - arrays: Some([...]) opens each by name with Array::async_open. No listing — usable against backends that can't list (plain HTTPS without WebDAV, S3-via-HttpStore). - arrays: None lists direct children with storage.list_dir, then Array::async_open each. Per-array open failures are logged at warn! and skipped, so a single malformed sibling (e.g. an xarray-style fixed-length-Unicode coord variable with a null fill_value that zarrs 0.23 can't open) no longer poisons the rest of the group. * zarrs_object_store pinned to 0.5 in the workspace; 0.6 depends on object_store 0.13, semver-incompatible with DataFusion 52's object_store 0.12.x. * Cloud smoke tests pass strictly against the public anonymous ITS_LIVE v2 ice-velocity datacubes (s3://its-live-data/...). Same bucket via s3:// (AmazonS3Builder) and via the virtual-hosted HTTPS URL (HttpStore), with an explicit M11/M12 filter to avoid listing on the HTTPS path.

paleolimbot · 2026-06-02T02:34:47Z

+fn default_object_store_for_uri(uri: &str) -> Result<Arc<dyn ObjectStore>, PyErr> {
+    // file:// and bare paths use a LocalFileSystem rooted at `/`;
+    // open_storage_from_uri prefixes it at the group's path. This
+    // matches the store the host's ObjectStoreRegistry yields for
+    // file://, so the loader treats local and cloud uniformly.
+    if uri.starts_with("file://") || !uri.contains("://") {
+        return Ok(Arc::new(object_store::local::LocalFileSystem::new()));
+    }
+    let url = Url::parse(uri)
+        .map_err(|e| PyValueError::new_err(format!("group URI {uri:?} is not a valid URL: {e}")))?;
+    match url.scheme().to_ascii_lowercase().as_str() {


There are some tests in the sedona crate (I think) that smoke test these openers that you can steal. Alternatively, include fewer options since we'll remove some (not sure HTTP applies, unless there's some directory listing stuff I don't know about, and gcs/azure could probably be dropped for now).

This is the temporary code until we pipe the store through. I'm not sure how much testing we want to add just for that

I think you can just remove the untested code here (filesystem and S3 should suffice to actually use this until we pipe through the store)

PyZarrChunkReader::new built a fresh current-thread tokio runtime on every open to bridge the async loader constructor into the sync Python FFI. Replace it with a process-wide OnceLock<Runtime> so the package builds the runtime once and reuses it. next() on the reader is pure CPU and never touches it.

Add the NASA JPL ITS_LIVE project/data URL (https://its-live.jpl.nasa.gov/) next to the cloud smoke-test bucket constants so the dataset the #[ignore] tests hit is self-documenting. Drop the `datafusion` umbrella and `async-trait` dependencies: neither is referenced anywhere in the crate. The only datafusion use is the sedona_internal_datafusion_err! macro, which expands to datafusion_common::DataFusionError and is satisfied by the retained datafusion-common dependency. Shrinks the compile and the wheel.

github-actions Bot requested a review from paleolimbot May 29, 2026 06:48

james-willis marked this pull request as draft May 29, 2026 06:48

james-willis force-pushed the jw/zarr-cloud-reads branch from 5af7c08 to 946e6a4 Compare May 29, 2026 07:16

james-willis mentioned this pull request May 29, 2026

Lazy Raster Loading Support for Zarr and GDAL #886

Open

2 tasks

james-willis force-pushed the jw/zarr-cloud-reads branch 3 times, most recently from 3fbe051 to 6e79a90 Compare June 1, 2026 18:43

james-willis force-pushed the jw/zarr-cloud-reads branch from 6e79a90 to 531283a Compare June 1, 2026 21:32

james-willis marked this pull request as ready for review June 1, 2026 21:39

paleolimbot reviewed Jun 2, 2026

View reviewed changes

james-willis added 2 commits June 2, 2026 09:46

james-willis requested a review from paleolimbot June 2, 2026 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sedona-raster-zarr): cloud storage backends (s3, gcs, azure, http) via object_store#888

feat(sedona-raster-zarr): cloud storage backends (s3, gcs, azure, http) via object_store#888
james-willis wants to merge 3 commits into
apache:mainfrom
james-willis:jw/zarr-cloud-reads

james-willis commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paleolimbot Jun 2, 2026

Uh oh!

james-willis Jun 2, 2026

Uh oh!

paleolimbot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

james-willis commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

All Async

Version pinning

Merge coordination (with #886)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paleolimbot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

james-willis Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

paleolimbot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

james-willis commented May 29, 2026 •

edited

Loading